欢迎访问昆明冶金高等专科学校学报官方网站,今天是 分享到:

昆明冶金高等专科学校学报 ›› 2015, Vol. 31 ›› Issue (5): 65-69.DOI: 10. 3969/j. issn. 1009—0479.2015.05.012

• 电子信息技术 • 上一篇    下一篇

基于Python自然语言处理工具包在语料库研究中的运用

刘旭   

  1. 云南师范大学外国语学院,云南 昆明650500
  • 收稿日期:2015-09-01 出版日期:2015-11-30 发布日期:2015-11-30
  • 作者简介:刘旭(1989—),男,云南昭通人,学科教学英语硕士研究生,主要从事外语教学及语料库语言学研究。

The Application of NLTK Toolkit Based on Python in Corpus Research

LIU Xu   

  1. Department of Foreign Languages,Yunnan Normal University, Kunming 650500,China
  • Received:2015-09-01 Online:2015-11-30 Published:2015-11-30

摘要:

国内当前以语料库为基础的研究,在研究工具方面,多以AntConc、PowerGREP为主,使用Python语言NLTK包进行数据处理分析的研究较少,限于软件自身设计,不能灵活地对研究方法提供支持。在研究中使用Python语言的NLTK处理包,使数据有了统一标准,避免了各类文字处理转换的麻烦,同时也弥补了Range等工具在句法分析、图形绘制、正则表达式检索等方面的缺憾。针对语料库研究的中文本分词、词形归并、文本检索统计等主要环节,简要介绍Python语言的NLTK自然语言处理包在语料库研究中的运用,并以古腾堡语料库中的简·奥斯丁小说《艾玛》为例,说明如何运用该自然语言处理包对语料进行加工处理。

关键词: Python, NLTK工具包, 语料库研究

Abstract:

According to the current domestic corpus based study,AntConc and PowerGREP are the mainresearch tool. Few studies were done using the Python language NLTK packet for data processing and analysis. It can not provide support to the research methods due to the design defect of the software. The Python language NLTK handling package was used in the study so that the data have uniform standards avoiding the conversion of various weakness of the range tool such as types of word processing workshop trouble. It also makes up for the syntactic analysis,graphic,regular expression search etc. In this paper,it was briefly introduced that the application of NLTK processing package based on Python in research. Then it takes the novel Emma written by Austen in Gutenberg corpus as an example to explain how to use the natural language processing to process the data.

Key words: Python, NLTK toolkit, corpus research

中图分类号: